tugHall: USER-GUIDE

2019-09-30

tugHall (tumor gene-Hallmark) is a simulator of a cancer-cell evolution model, wherein gene mutations are linked to the tumor cell behaviors that are influenced by the hallmarks of cancer.

This is a script in R to simulate the cancer cell evolution in the framework of the model proposed by prof. Mamoru Kato, Head of Bioinformatics Department, Research Institute, Nation Cancer Center, Tokyo, JAPAN.


Authors and contributor list:

Iurii Nagornov

Mamoru Kato

Department of Bioinformatics, Research Institute, National Cancer Center Japan, Tokyo, Japan / https://www.ncc.go.jp/en/ri/department/bioinformatics/index.html

All other known bugs and fixes can be sent to inagonov@ncc.go.jp


Abstract

The flood of recent cancer genomic data requires a coherent model that can sort out the findings to systematically explain clonal evolution and the resultant intra-tumor heterogeneity (ITH). Here, we present a new mathematical model designed to computationally simulate the evolution of cancer cells. The model connects the well-known hallmarks of cancer with the specific mutational states of tumor-related genes. The cell behavior phenotypes are stochastically determined and the hallmarks probabilistically interfere with the phenotypic probabilities. In turn, the hallmark variables depend on the mutational states of tumor-related genes. Thus, our software can deepen our understanding of cancer-cell evolution and generation of ITH.


Licence Information

GNU GENERAL PUBLIC LICENSE Version 3 from 29 June 2007


Project source can be downloaded from website:

https://github.com/nagornovys/Cancer_cell_evolution


Input parameters

Hallmarks variables

Trial probabilities and hallmark variables are listed in the model description. The variable of hallmark \(x\), \(H_x\), is calculated as a linear combination of the weights, \(w^x_i\), and the gene indicator variables, \(g^x_i\), as follows:

\[ H_x = \sum^{n_x}_{i=1}{w^x_i \cdot g^x_i}, \]

\[ \textrm{such that}~~~~ \sum^{n_x}_{i=1}{w^x_i}=1,~~~~ 0 \leq w^x_i \leq 1 \label{eq:02} \]

\[ g^x_i = \left\{ \begin{array}{ll} 1, & \textrm{when the gene is impared}\\ 0, & \textrm{otherwise,} \end{array} \right. \] where \(n_x\) represents the number of genes related to hallmark \(x\). If effective trial probabilities go over \(1\) or under \(0\) due to hallmark variables, the probability values are set to \(1\) or \(0\), respectively. This modeling by linear combination is provided for simple interpretability because addition of probabilities is intuitively easy to understand.

There is an input file to define the halmarks variables and weights (first 10 lines):

Table 1. Input file for genes. Example of input file for hallmarks and weights
Genes length CDS Hallmark Suppressor or Oncogene Weights
APC 8532 apoptosis s 0.1111111
APC 8532 growth s 0.1666667
APC 8532 invasion s 0.2857143
KRAS 567 apoptosis o 0.3333333
KRAS 567 growth o 0.3333333
KRAS 567 immortalization o 0.5714286
KRAS 567 angiogenesis o 0.4285714
KRAS 567 invasion o 0.1428571
TP53 1182 apoptosis s 0.1111111
TP53 1182 growth s 0.1666667
  1. Genes. # eg, TP53, KRAS - name of gene.

  2. length CDS # eg, 2724, 10804 - length of CDS for each gene.

  3. Hallmark. # eg, apoptosis - name of hallmark.

  4. Suppressor or oncogene. # eg, s, o - suppressor (s) or oncogene (o).

  5. Weights. # eg, 0.333, 0.5 - weights for hallmark-genes relations.


After that the program defines all weights, and all unknown weights equal 0. Note that program makes normalization, so sum of all weights should be equal 1 (first 10 lines):

Table 2. Weights for hallmarks. Example of weights for hallmarks and genes from input table. Unknown values are equal 0.
Genes Apoptosis, \(H_a\) Angiogenesis, \(H_b\) Growth / Anti-growth, \(H_d\) Immortalization, \(H_i\) Invasion / Metastasis, \(H_{im}\)
APC 0.1590909 0.0000000 0.1666667 0.0000000 0.2195122
KRAS 0.4772727 0.4285714 0.3333333 0.5714286 0.1097561
TP53 0.1590909 0.4285714 0.1666667 0.4285714 0.3292683
PIK3CA 0.2045455 0.1428571 0.3333333 0.0000000 0.3414634
  1. Genes # eg, TP53, KRAS - name of gene.

  2. Apoptosis, \(H_a\) # eg, 0.333, 0.5 - weights of hallmark “Apoptosis” with genes.

  3. Angiogenesis, \(H_b\) # eg, 0.333, 0.5 - weights of hallmark “Angiogenesis” with genes.

  4. Growth / Anti-growth, \(H_d\) # eg, 0.333, 0.5 - weights of hallmark “Growth / Anti-growth” with genes.

  5. Immortalization, \(H_i\) # eg, 0.333, 0.5 - weights of hallmark “Immortalization” with genes.

  6. Invasion / Metastasis, \(H_{im}\) # eg, 0.333, 0.5 - weights of hallmark “Invasion / Metastasis” with genes.


Probabilities of processes

The all probabilities and the functional dependences are defined in the model description. We here explain the trials and hallmarks:


Parameter values

We will explain the possible parameter values:

The other parameters:

Input the probabilities

The input of probabilities is possible in the code “tugHall.R”:

E0 <- 2E-4 # parameter in the division probability
F0 <- 1E0 # parameter in the division probability
m <- 1E-6 # mutation probability
uo <- 0.5 # oncogene mutation probability
us <- 0.5 # suppressor mutation probability
s <- 10 # parameter in the sigmoid function
k <- 0.1 # Environmental death probability


Also in the code we can define names of input and output files, and additional parameters of simulation:

genefile <- ‘gene_cds2.txt’ # file with information about weights
cellfile <- ‘cellinit.txt’ # initial Cells
geneoutfile <- ‘geneout.txt’ # Gene Out file with Hallmarks
celloutfile <- ‘cellout.txt’ # output information of simulation
logoutfile <- ‘log.txt’ # log file to save the input information of simulation
censore_n <- 30000 # Max cell number where the program forcibly stops
censore_t <- 200 # Max time where the program forcibly stops


Structure of files and directories

In the root directory:

User-Guide.Rmd - the user guide in Rmd format.

User-Guide.html - the user guide in html format.

dir ./tigHall/ - the directory with the program.


In the ./tigHall/ directory:

tugHall.R - program to run a simulation and define the parameters.

dir /Code/ - the folder with the code and the library of functions.

dir /Input/ - the folder with the input files.

dir /Output/ - the folder with the output files.

dir /Figures/ - the folder with figures of the plots.


In the /Code/ directory:

CanSim.bib, pic_lic.jpg - the necessesary files for user guide.

tugHall_functions.R - the file with the functions for a simulation / core of program.

Analysis.R - the file to analyse the results of a simulation and plot figures.

Functions.R - the file with the functions for the analysis of results.


In the /Input/ directory:

cellinit.txt - the file with a list of the initial cells with/without destroyed genes.

gene_cds2.txt - the file with the halmarks variables and the weights in the format of Table.1.


In the /Output/ directory:

cellout.txt - the file with output of simulation in the format of Table.3 (see below).

geneout.txt - the file with information about halmarks variables and the weights.

log.txt - the file with information about all parameters.

Order_of_dysfunction.txt - the file with an information about order of gene dysfunction during evolution in the format of Table.4 (see below).

Weights.txt - the file with information about weights between hallmarks and genes in the format of Table.2.

VAF.txt - information about variant allel frequency (VAF) for each gene and each site in the genes in the format of Table.5 (see below).


In the /Figures/ directory there are 13 figures of the *.jpg format, which appear after analysis of simulation. The examples and the descriptions are below in the section “Output data after the simulation”.

How to RUN the simulation

In order to make the simulation, please, follow next procedure:

  1. Copy /tugHall/ directory into the working directory.

  2. CD to the /tugHall/ directory.

  3. Run tugHall.R file, where:
  1. To make report of simulation, please, rename and change “User-Guide.RMD”, which uses the output and input files in /Input/, /Output/, /Figures/ directories.

Output data after the simulation

Output data content several files and many figures. The “log.txt” and “geneout.txt” files have copied input information about variables and gene names. “Weights.txt” has information about weights of genes for hallmarks. “Cellout.txt” has information about dynamics of cell evolution and all variabls (see below). “Order_of_dysfunction.txt” has information about order of gene dysfunction during evolution (see below example). “VAF.txt” file has information about variant allel frequency for each gene and each site in the genes, which were mutated (see description below). And finally, the folder “Figures/” has many plots (see examples below).

Contents of “cellout.txt” file

File “cellout.txt” contents all output data for each cell at each timestep (first 10 lines):

Table 5. Output data. Example of output data for all cells. The names of columns are related the description in the Tables 1,2 and Figures 2,3
Time AvgOrIndx ID ParentID.Birthday c. d. i. im. a. k. E. N Nmax. M Ha Him Hi Hd Hb type mut_den PosDriver.APC PosDriver.KRAS PosDriver.TP53 PosDriver.PIK3CA PosPasngr.APC PosPasngr.KRAS PosPasngr.TP53 PosPasngr.PIK3CA Clone.number Passengers.Clone.number Mix.Clone.number
1 avg - - 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0 0 - - -
1 1 1 0:0 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0:0 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0
1 3 3 0:0 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0
1 4 4 0:0 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0
1 5 5 0:0 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0
1 6 6 0:0 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0
1 7 7 0:0 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0
1 8 8 0:0 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0
1 9 9 0:0 0 1 1 0 0.0066929 0.1 5e-04 1000 2000 0 0 0 0 0 0 0 0 0 0 0
  1. Time # eg, 1, 50 - the time step.
  2. AvgOrIndx # eg, avg, 4,7 - “avg” is for lines with averaging values, index like 1,4,5 etc. is for cells index at the current time step.
  3. ID # eg, 1, 50 - the unique ID of cell.
  4. ParentID.Birthday # eg, 0:0, 45:5 - the first number is parent ID, the second number is birthday time step.
  5. c # eg, 0, 7 - the counter of cell divisions.
  6. d # eg, 0.1, 0.8 - the value of probability of division for the cell.
  7. i # eg, 0.1, 0.8 - the value of probability of immortalization for the cell.
  8. im # eg, 0.1, 0.8 - the value of probability of invasion/metastasis for the cell.
  9. a # eg, 0.1, 0.8 - the value of probability of apoptosis for the cell.
  10. k # eg, 0.1, 0.8 - the value of probability of death due to the environment for the cell.
  11. E # eg, 10^4, 10^5 - the coefficient for the function of the division probability.
  12. N # eg, 134, 5432 - the number of normal cells at this time step.
  13. Nmax # eg, 10000, 5000 - the maximal number of normal cells for the cell.
  14. M # eg, 16, 15439 - the number of metastasis cells at this time step.
  15. Ha # eg, 0.1, 0.4444 - the value of the hallmar “Apoptosis” for the cell.
  16. Him # eg, 0.1, 0.4444 - the value of the hallmar “Invasion / Metastasis” for the cell.
  17. Hi # eg, 0.1, 0.4444 - the value of the hallmar “Immortalization” for the cell.
  18. Hd # eg, 0.1, 0.4444 - the value of the hallmar “Growth / Anti-growth” for the cell.
  19. Hb # eg, 0.1, 0.4444 - the value of the hallmar “Angiogenesis” for the cell.
  20. type # eg, 0, 1 - the type of the cell: “0” is normal cell, “1” is metastasis cell.
  21. mut_den # eg, 0, 0.32 - the density of the mutation for the cell.
  22. PosDriver.(Gene_1=“APC”) # eg, 3493:4, 4531:34 - the first number is the site and the second number is the time step of mutation for the gene with driver mutation.
  23. PosDriver.(Gene_2=“KRAS”) # eg, 3493:4, 4531:34 - the first number is the site and the second number is the time step of mutation for the gene with driver mutation.
  24. PosDriver.(Gene_…) # eg, 3493:4, 4531:34 - the first number is the site and the second number is the time step of mutation for the gene with driver mutation.
  25. PosDriver.(Gene_last=“PIK3CA”) # eg, 1, 3493:4, 4531:34 - the first number is the site and the second number is the time step of mutation for the gene with driver mutation.
  26. PosPassngr.(Gene_1=“APC”) # eg, 933:8, 3825:24 - the first number is the site and the second number is the time step of mutation for the gene with passenger mutation.
  27. PosPassngr.(Gene_2=“KRAS”) # eg, 933:8, 3825:24 - the first number is the site and the second number is the time step of mutation for the gene with passenger mutation.
  28. PosPassngr.(Gene_…) # eg, 933:8, 3825:24 - the first number is the site and the second number is the time step of mutation for the gene with passenger mutation.
  29. PosPassngr.(Gene_last=“PIK3CA”) # eg, 933:8, 3825:24 - the first number is the site and the second number is the time step of mutation for the gene with passenger mutation.
  30. Clone.number # eg, 15, 4 - the clone number is calculated from binary code of mutated genes. For example, if gene is mutated then its value in binary code equal 1, and if no, it is 0. For example, the cells have only 4 genes in simulation, so “Clone.number” can have binary numbers from 0000 to 1111, that is related decimal numbers from 0 to 15.
  31. Passengers.Clone.number # eg, 15, 4 - same as for “Clone.number”, but for passenger mutations.
  32. Mix.Clone.number # eg, 35, 16 - same as for “Clone.number”, but for passenger and driver mutations together.

“Order_of_dysfunction.txt” file

“Order_of_dysfunction.txt” has information about order of gene dysfunction during evolution in the next format (first 10 lines):

Table 6. Order of gene dysfunction.
Order of gene dysfunction Number of cells
APC TP53 PIK3CA KRAS 41006
1720
PIK3CA TP53 KRAS APC 396
APC TP53 PIK3CA 20
PIK3CA TP53 KRAS 20
APC 13
PIK3CA 10
TP53 PIK3CA APC 4
TP53 2
KRAS 1
  1. Order of gene dysfunction. Order of gene dysfunctionis is the list of gene’s names in the order of mutations - from first to last. Blank line is related to the normal cells.
  2. Number of cells. The number of cell is number of cells in the pool with same order.

“VAF.txt” file

“VAF.txt” file has information about variant allel frequency (VAF) for each gene and each site in the genes, which were mutated (first 10 lines):

Table 7. Variant allel frequency.
DriverPasngr Gene Position VAF_Primary Ncells_Primary_wMutation Ncells_Primary VAF_Metastatic Ncells_Metastatic_wMutation Ncells_Metastatic VAF_PriMet Ncells_PriMet_wMutation Ncells_PriMet
D APC 3 0 0 1793 0.0000845 7 41400 0.0000810 7 43193
D APC 7 0 0 1793 0.0001449 12 41400 0.0001389 12 43193
D APC 8 0 0 1793 0.0000242 2 41400 0.0000232 2 43193
D APC 9 0 0 1793 0.0000242 2 41400 0.0000232 2 43193
D APC 10 0 0 1793 0.0000121 1 41400 0.0000116 1 43193
D APC 15 0 0 1793 0.0000121 1 41400 0.0000116 1 43193
D APC 19 0 0 1793 0.0000121 1 41400 0.0000116 1 43193
D APC 20 0 0 1793 0.0021981 182 41400 0.0021068 182 43193
D APC 22 0 0 1793 0.0001691 14 41400 0.0001621 14 43193
D APC 23 0 0 1793 0.0000121 1 41400 0.0000116 1 43193
  1. DriverPasngr. # D or P to indicate distinction between Driver and Passenger.

  2. Gene. # eg, TP53, KRAS - name of gene.

  3. Position. # eg, 123, 1000 - position at mutated site in gene.

  4. VAF_Primary. # eg, 0.2 - VAF for Primary cells = half of N_mutated_Primary / N_Primary cells

  5. Ncells_Primary_wMutation. # eg, 40 - number of Primary mutated cells N_mutated_Primary

  6. Ncells_Primary. # eg, 100 - number of Primary cells

  7. VAF_Metastatic. # VAF for Metastatic cells = half of N_mutated_Metastatic / N_Metastatic

  8. Ncells_Metastatic_wMutation. # number of Metastatic mutated cells N_mutated_Metastatic

  9. Ncells_Metastatic. # number of Metastatic cells N_Metastatic

  10. VAF_PriMet. # VAF for all cells = half of N_mutated_cells / N_all

  11. Ncells_PriMet_wMutation. # number of all mutated cells

  12. Ncells_PriMet. # number of all cells


“Figures/” folder and output plots

The directory “Figures/” contents many output figures, generated during analysis process of “cellout.txt” file, including evolutions of number of normal and metastasys cells (Fig.4A), hallmarks (Fig.4B) and probabilities (Fig.5A).

Fig.4. Results of simulation: left - evolution of number of cells, right - evolution of hallmarks.


Fig.5. Results of simulation: left - evolution of probabilities, right - evolution of number of clones.


Fig.5B shows the evolution of number of clones. Here we have to define clone in simulation is pool of cells with a same set of mutated genes for driver only or for all mutations (driver and passenger). For this propose we define clone ID as bunary number of mutated genes, please, see “cellout.txt” file for columns “Clone.number”, “Passengers.Clone.number” and “Mix.Clone.number”. Also analysis can calculate the evolution of number of cells in each clone:

Fig.6. Results of simulation: left - evolution of number of cells in clones, right - same plot with scale out.


Fig.7 shows number of cells in each clone at last time step to see which clone is dominant and prevails above others:

Fig.7. Results of simulation: left - barplot for number of cells in clones for driver mutated clone, right - same plot for all mutated cells.


During the simulation each cell has “ID of clone”, which calculated from binary code of mutated genes. For example, if gene is mutated then its value in binary code equal 1, and if no, it is 0. For example, the cells have only 4 genes in simulation, so “ID of clone” can have binary numbers from 0000 to 1111, that is related decimal numbers from 0 to 15. Each cell has information about parent ID and time of birthday, so it is possible to calculate cells with same “ID of clone”, after that find cell with earlest birthday and get parent ID of it. Using this parent ID to find “ID of clone” for parent, which is related parent of clone. Follow this procedure we find all relations between parent and children for clones and construct the tree for clones (cells) at last time step (Fig.8).

Fig.8. Tree of clones at last time step. The numbers of tree indicate the ID of clones barplot for number of cells in clones for driver mutated clone.


Figs.9 and 10 shows inequality coefficients (for driver mutated cells and for any type mutation) as function of:

Fig.9. Results of simulation of inequality coefficient for driver mutated cells and for cells with any type mutations: left - evolution of inequality coefficient, right - inequality coefficient as function of all cells.

Fig.10. Results of simulation of inequality coefficient for driver mutated cells and for cells with any type mutations: left - inequality coefficient as function of normal cells, right - inequality coefficient as function of metastasis cells.


Funding

This work was supported by CREST-JST (14531766); MEXT (15K06916); and AMED (16ck0106013h0003).

Acknowledgements

We thank Asmaa Elzawahry, Yusuke Suenaga, Sana Yokoi, Yoshitaka Hippo, Atsushi Niida, and Daniel A. Vasco for useful suggestions.


References

Abbott, R. G., S. Forrest, and K. J. Pienta. 2006. “Simulating the Hallmarks of Cancer.” Journal Article. Artif Life 12 (4): 617–34. doi:10.1162/artl.2006.12.4.617.

Basanta, D., B. Ribba, E. Watkin, B. You, and A. Deutsch. 2011. “Computational Analysis of the Influence of the Microenvironment on Carcinogenesis.” Journal Article. Math Biosci 229 (1): 22–29. doi:10.1016/j.mbs.2010.10.005.

Friberg, S., and S. Mattson. 1997. “On the Growth Rates of Human Malignant Tumors: Implications for Medical Decision Making.” Journal Article. J Surg Oncol 65 (4): 284–97. https://www.ncbi.nlm.nih.gov/pubmed/9274795.

Hanahan, D., and R. A. Weinberg. 2000. “The Hallmarks of Cancer.” Journal Article. Cell 100 (1): 57–70. http://www.ncbi.nlm.nih.gov/pubmed/10647931.

Monteagudo, A., and J. Santos. 2014. “Studying the Capability of Different Cancer Hallmarks to Initiate Tumor Growth Using a Cellular Automaton Simulation. Application in a Cancer Stem Cell Context.” Journal Article. Biosystems 115: 46–58. doi:10.1016/j.biosystems.2013.11.001.

Preston, B. D., T. M. Albertson, and A. J. Herr. 2010. “DNA Replication Fidelity and Cancer.” Journal Article. Semin Cancer Biol 20 (5): 281–93. doi:10.1016/j.semcancer.2010.10.009.

Spencer, S. L., R. A. Gerety, K. J. Pienta, and S. Forrest. 2006. “Modeling Somatic Evolution in Tumorigenesis.” Journal Article. PLoS Comput Biol 2 (8): e108. doi:10.1371/journal.pcbi.0020108.

Vogelstein, B., N. Papadopoulos, V. E. Velculescu, S. Zhou, Jr. Diaz L. A., and K. W. Kinzler. 2013. “Cancer Genome Landscapes.” Journal Article. Science 339 (6127): 1546–58. doi:10.1126/science.1235122.

Wang, Y., J. Waters, M. L. Leung, A. Unruh, W. Roh, X. Shi, K. Chen, et al. 2014. “Clonal Evolution in Breast Cancer Revealed by Single Nucleus Genome Sequencing.” Journal Article. Nature 512 (7513): 155–60. doi:10.1038/nature13600.

Watson, I. R., K. Takahashi, P. A. Futreal, and L. Chin. 2013. “Emerging Patterns of Somatic Mutations in Cancer.” Journal Article. Nat Rev Genet 14 (10): 703–18. doi:10.1038/nrg3539.

Weinberg, Robert. 2013. The Biology of Cancer. Book. Garland science.

Williams, M. J., B. Werner, T. Heide, C. Curtis, C. P. Barnes, A. Sottoriva, and T. A. Graham. 2018. “Quantification of Subclonal Selection in Cancer from Bulk Sequencing Data.” Journal Article. Nat Genet 50 (6): 895–903. doi:10.1038/s41588-018-0128-6.